Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
Motivation
Rates of depression appear to have been increasing among American youths since around 2010 according to a recent report. A recent study) also shows that youths appear to be seeking more care from mental health services.
This case study will explore how rates of major depressive episodes have changed since the early 2000s. As well as how rates of treatment for depression of youths have changed over time.
Photo by K. Mitch Hodge on Unsplash
The major symptoms of a major depressive episode include:
Sleep disorder (increased or decreased)
Interest deficit (anhedonia)
Guilt (worthlessness, hopelessness, regret)
Energy deficit
Concentration deficit
Appetite disorder (increased or decreased)
Psychomotor retardation or agitation
Suicidality

Click here to see the diagnostic requirements for a major depressive episode according to the DSM 5
Five or more of the following symptoms have been present and documented during the same two-week period and represent a change from previous functioning; at least one of the symptoms is either (1) depressed mood or (2) loss of interest or pleasure.
Note: Do not include symptoms that are clearly attributable to another medical condition.
Depressed mood most of the day, nearly every day, as indicated by either subjective report (e.g., feels sad, empty, hopeless) or observation made by others (e.g., appears tearful)
Markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly every day (as indicated by either subjective account or observation)
Significant weight loss when not dieting or weight gain (e.g., a change of more than 5% of body weight in a month), or decrease or increase in appetite nearly every day
Insomnia or hypersomnia nearly every day
Psychomotor agitation or retardation nearly every day (observable by others, not merely subjective feelings of restlessness or being slowed down)
Fatigue or loss of energy nearly every day
Feelings of worthlessness or excessive or inappropriate guilt (which may be delusional) nearly every day (not merely self-reproach or guilt about being sick)
Diminished ability to think or concentrate, or indecisiveness, nearly every day (either by subjective account or as observed by others)
Recurrent thoughts of death (not just fear of dying), recurrent suicidal ideation without a specific plan, or a suicide attempt or a specific plan for committing suicide
B. The symptoms do not meet criteria for a mixed episode.
C. The episode is not attributable to the physiological effects of a substance or to another medical condition.
Note: Criteria A-C represent a major depressive episode.
Note: Responses to a significant loss (e.g., bereavement, financial ruin, losses from a natural disaster, a serious medical illness or disability) may include feelings of intense sadness, rumination about the loss, insomnia, poor appetite and weight loss noted in Criterion A, which may resemble a depressive episode. Although such symptoms may be understandable or considered appropriate to the loss, the presence of a major depressive episode in addition to the normal response to a significant loss should also be carefully considered. This decision inevitably requires the exercise of clinical judgment based on the individual’s history of and the cultural norms for the expression of distress in the context of loss.
D. The occurrence of the major depressive episode is not better explained by schizoaffective disorder, schizophrenia, schizophreniform disorder, delusional disorder, or other specified and unspecified schizophrenia spectrum and other psychotic disorders.
E. There has never been a manic episode or a hypomanic episode.
Note: This exclusion does not apply if all of the manic-like or hypomanic-like episodes are substance-induced or are attributable to the physiological effects of another medical condition.
This case study is motivated by the following two articles:
Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. J Abnorm Psychol.128,3 (2019):185-199. doi:10.1037/abn0000410
Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. JAMA Psychiatry. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.
The main findings of the first article are:
Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25.
Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over.
Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.
While the main findings of the second article are:
Compared with adult mental health care, the mental health care of young people has increased more rapidly.
Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses per 100 population increased significantly faster for youths (from 7.78 to 15.30 visits) than for adults (from 23.23 to 28.48 visits) (interaction: P < .001).
Psychiatrist visits also increased significantly faster for youths (from 2.86 to 5.71 visits).
Again while depression appear to be on the rise for youths, youths also appear to be seeking more mental health care.
In this case study we will be using data from the National Survey on Drug Use and Health (NSDUH) related to treatment and major depressive episode rate to explore how this have changed over time and how different groups compare. This data was also used in the first referenced article.
Main Questions
Our main questions:
- How have depression rates in American youth changed since 2004, according to the NSDUH data? How have rates differed between different youth subgroups (age, gender, ethnicity)?
- Do mental health services appear to be reaching more youths? Again, how have rates differed between different youth subgroups (age, gender, ethnicity)?
Learning Objectives
Statistical Learning Objectives:
- Define and create a contingency table.
- Implementation of a chi-squared test for independence.
- Interpretation of a chi-squared test for independence.
Data science Learning Objectives:
- Scrape data directly from a website (
rvest).
- Subset and filter data (
dplyr).
- Write functions to wrangle data repetitively.
- Work with character strings (
stringr).
- Reshape data into different formats (
tidyr).
- Data visualizations (
ggplot2) with labels (directlabels) and facets for different groups.
In this case study, we will especially focus on using packages and functions from the Tidyverse, such as rvest. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

We will begin by loading the packages that we will need:
| here |
to easily load and save data |
| rvest |
to scrape web pages |
| dplyr |
to scrape web pages |
| magrittr |
to scrape web pages |
| stringr |
to manipulate strings |
| tidyr |
to change the shape or format of tibbles to wide and long |
| tibble |
to create tibbles and convert values of a column to row names |
| ggplot2 |
to create plots |
| directlabels |
to add labels directly to lines in plots |
| forcats |
to reorder factor for plot |
| cowplot |
to combine plots together |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Context
According to the CDC the rate of suicide has also increased for most age groups in the United States over the past decade and a half.

While suicide does appear to be increasing among youths it also appears to be increasing among middle aged adults for both females and males.

According to the CDC:
Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States. In 2016, suicide became the second leading cause of death among those aged 10–34 and the fourth leading cause among those aged 35–54.

So although suicide is on the rise for most age groups, suicide is one of the top two contributors to death for youths.
Thus this warrants further examination of the mental health of American youths.

Historically, suicide rates were much higher before 1950, however, we are seeing an increase in the last 20 years.

If you are in crisis and need help, call this toll-free number for the National Suicide Prevention Lifeline (NSPL), available 24 hours a day, every day: 1-800-273-TALK (8255). The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at www.suicidepreventionlifeline.org.
The Crisis Text Line is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to 741741 and a trained crisis counselor will respond to you with support and information over text message. Visit www.crisistextline.org.
Also see here for more information about how to recognize and help youths experiencing symptoms of depression.
Limitations
avocado -Michael:Perhaps “underestimates in the p-values…” is not the correct way to phrase this. I would look for a better way to word this.-Carrie:This is my attempt after Michael’s… open to changes!
There are some important considerations regarding this data analysis to keep in mind:
The data that we will use come from a survey and are therefore values from a sample that estimate that of the true population. In our statistical analysis we use these sample values as if they are population estimates (because this is all we have access to). Thus our results are not necessarily indicative of true differences.
Furthermore, the sampling mechanism utilized can introduce selection bias in cases where the the sampling methods do not produce a representative sample.
Data is collected from human participants; this presents the potential for information bias, as there is the potential that participants in the sampling frame may for a variety of reasons report inaccurate information.
What are the data?
We will be using data from the National Survey on Drug Use and Health (NSDUH) which is directed by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS).
This survey started in 1971 and is conducted annually in all 50 states and the District of Columbia. Approximately 70,000 people (age 12 and up) are interviewed each year about health related issues. Only civilian, non-institutionalized individuals are included. Households are randomly selected and than a professional interviewer visits the addresses and asks one or two of the residents to interview. The interviewer brings a laptop with them that the participants use to fill out the survey which typically takes an hour to complete. If a participant chooses to participate they receive $30 in cash. All collected information is confidential and is used for disease surveillance and to guide public policy particularly focused on drug and alcohol use as well as mental health. See here for more details about the survey.
This data is made available publicly online on the Substance Abuse & Mental Health Data Archive.

At the website for the survey data, you can see that the results are displayed in many tables. Importantly, there is no obvious way to download the data directly from this particular website.

If one clicks on the TOC button on the far right upper corner they will be directed to another website, where a large pdf document containing of all of the results can be downloaded.
We are interested in investigating how depression rates have changed and how youths are interacting with mental health services. Thus the following tables are of interest to us are:
| Table 11.1A |
Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018 |
| Table 11.1B |
Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018 |
| Table 11.2A |
Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018 |
| Table 11.2B |
Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018 |
| Table 11.3A |
Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018 |
| Table 11.3B |
Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018 |
| Table 11.4A |
Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018 |
| Table 11.4B |
Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018 |
According to the NSDUH 2018 report
Respondents were defined as having had an MDE in the past 12 months if they had at least one period of 2 weeks or longer in the past year when they experienced a depressed mood or loss of interest or pleasure in daily activities, accompanied by problems with sleeping, eating, energy, concentration, or self-worth. The MDE questions are based on diagnostic criteria from DSM-5. Some of the wordings of the depression questions for adolescents aged 12 to 17 and adults aged 18 or older differed slightly to make the questions more developmentally appropriate for adolescents.
Adolescents were defined as having an MDE with severe impairment if their depression caused severe problems with their ability to do chores at home, do well at work or school, get along with their family, or have a social life.
Data Import
Data is often made available online. Usually, the data we are interested in is made available for download on the page as a delimited text file or an excel file. However, sometimes data is not made available in this manner, such as the NSDUH survey data.
How do we proceed in this scenario?
We can manually copy each cell of data, however, this process is often inefficient, subject to error, and not reproducible. Say we wanted to run an analysis next year on the next years data and it happens to be formatted in the same way.
We can also use R for web scraping.
Web scraping is the process of extracting data from a website.
Basic steps of web scraping
There are two main steps to web scraping:
Identify location of data on the webpage that will be scraped
Save the webpage element to an object
We accomplish STEP 1 with our web browser.
We accomplish STEP 2 in the R programming environment.
Additional resources for web scraping:
In this case study we will scrape data from the tables on the NSDUH survey website. This data is available in a large PDF with all the results form the year. However it is not easy to find this PDF and it would be difficult and time consuming to find our tables of interest and to extract the data from the pdf with pdftools. Again, if we instead decided to copy paste the data from the website to another file that we would also need to import, this would not be as efficient or reproducible and might result in errors.
Alternatively, we will use the rvest package to scrape the data directly from the tables on the website. Assuming the data next year would be displayed in a similar manner, this could allow us simply modify our code based on the url for the data next year to run the same analysis on the data easily.
The rvest package can be thought of as the pdftools package for web scraping. Upon pulling the data, additional wrangling will likely be required; but like the pdftools package, rvest streamlines the extraction process.
Steps for scraping tables
The two web scraping steps for these tables can be broken down even further:
- Identify location of data that will be scraped
- right-click to inspect element (webpage)
- hover pointer over components of element (webpage) until the data has been found
- copy XPath of data sought
- Save webpage element to an object in R
- import html code for the webpage
- extract pieces of HTML documents (webpage) using XPath
- parse the extracted data into a data frame
Below is a animated overview of the process.
